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BRIEF ON APPEAL 



Sir: 



Further to the Notice of Appeal filed July 1, 2003, and received at the Patent Office on 
July 3, 2003, herewith are three copies of Appellants' Brief on Appeal. Authorized fees include 
the $320 fee for the filing of this Brief 

This is an appeal from the decision of the Examiner fmally rejecting claims 46, 48, 49, 
51, 53-60, and 66-68 of the above- identified application. 



(n REAL PARTY IN INTEREST 
The above-identified application is assigned of record to Incyte Pharmaceuticals, Inc. 
(now hicyte Corporation), (Reel 9851, Frames 0199 and 0206) who is the real party in interest 
herein. 
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(2^ RELATED APPEALS AND INTERFERENCES 
Appellants, their legal representative and the assignee are not aware of any related appeals or 
interferences which will directly affect or be directly affected by or have a bearing on the Board's 
decision in the instant appeal. 



(3) STATUS OF THE CLAIMS 
Claims rejected: Clainis 46, 48, 49, 51, 53-60, and 66-68 
Claims allowed: Claim 65 
Claims canceled: Claims 1-44, 50, 52, 63, and 64 
Claims withdrawn: Claims 45, 47, 61, and 62 

Claims on Appeal: Claims 46, 48, 49, 51, 53-60, and 66-68 (A copy of the claims on 

appeal, as amended, can be found in the attached Appendix.) 



(4) STATUS OF AMENDMENTS AFTER HNAL 
No amendments were submitted after Final Rejection. 



(5) SUMMARY OF THE INVENTION 
Appellants' invention is directed to antibodies which specifically bind to polypeptides, including 
nucleotide pyrophosphohydrolase NTPPH-2, con[5)rising the amino acid sequence of SEQ ID N0:1 
(Specification, e.g., at page 3, lines 17-19; page 4, lines 27-29; page 9, lines 15-17; page 30, lines 16- 
22; and page 31, lines 7-13). Appellants' invention also includes antibodies which specifically bind to 
polypeptides at least 90% identical to SEQ ID NO: 1 (e.g., at page 17, lines 1-5), or to polypeptides 
which comprise fi-agments of SEQ ID N0:1 (e.g., at page 4, lines 27-29; page 9, lines 15-21; page 30, 
lines 23-25; and page 31, lines 7-13). The invention further includes compositions con^rising the 
foregoing antibodies (e.g., at page 34, liae 30 to page 35, line 4), and methods of making the foregoing 
antibodies (e.g., at page 30, line 15 to page 32, line 15; and page 55, line 28 to page 56, line 12). 
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NTPPH-2 has strong chemical and structural homology with a human nucleotide 
pyrophosphohydrolase, NTPPH-1 (Incyte ID 422069; SEQ ID N0:3) (Specification, e.g., at page 16, 
lines 15-16). In particular, NTPPH-2 and NTPPH-1 share 50% sequence identity (e.g., at page 16, 
lines 16-17; and Figures 2A, 2B, and 2C). In addition: 

"NTPPH-2 is 1 156 amino acids in length (Figures lA-lK) and has three potential N- 
glycosylation sites at N276, N308, N329, 25 potential phosphorylation sites at T24, S135, 

^229? T245, S2675 S325, T331, T372, S427, S434, S439, T5J7, T523, Y599, Tgos* ^630? T750, T847, 

S883, Y909, S977, S1017, Tio63, S1068, and T1149. . . As illustrated by Figures 3A and 3B, 
NTPPH-2 and NTPPH-1 have similar hydrophobicity plots and both show a 
hydrophobic signal sequence. The predicted isoelectric points for NTPPH-2 and 
NTPPH-1 are 8.07 and 8.21, respectively. Membrane-based northern analysis 
showed the highest level of NTPPH-2 mRNA expression in cartilage and lower, but 
significant, expression in testes, trachea, and bone marrow. Electronic northem analysis 
shows the expression of this sequence in various libraries, at least 57% of which involve 
immunological response and many of which are cartilage or joint related and at least 
26% of which involve inmiortalized or cancerous cells and tissues. Of particular note is 
the expression of NTPPH-2 iu rheumatoid and osteoarthritic synovial, chondrocyte, 
and tibial libraries." (Specification at page 16, lines 12-30) 

The antibodies of the present invention are useful, for example, for purifying and detecting 
polypeptides which have specific uses in toxicology testing, drug discovery, and disease diagnosis 
(Specification, e.g., at page 26, line 25 to page 27, line 3; page 38, line 14 to page 39, line 5; and page 
46, lines 27-30). 

(6) ISSUES 

1. Whether claims 46, 48, 49, 51, 53-60, and 66-68 meet the written description requirement 
of 35 U.S.C. § 112, first paragraph. 

2. Whether claims 46, 48, 49, 51, 53-60, and 66-68 meet the enablement requirement of 35 
U.S.C. § 112, first paragraph. 



3. Whether claims 46, 48, 49, 51, 53-60, and 66 meet the requirements of 35 U.S.C. § 112, 
second paragraph. 
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m GROUPING OF THE CLAIMS 

As to Issue 1 

Claims 46, 48, 49, 51, 53-60, and 66-68 are grouped together. 
As to Issue 2 

Claims 46, 48, 49, 51, 53-60, and 66-68 are grouped together. 
As to Issue 3 

Claims 46, 48, 49, 51, 53-60, and 66 are grouped together. 



(8) APPELLANTS' ARGUMENTS 

Issue 1 - Whether claims 46. 48. 49. 51, 53-60. and 66-68 meet the written description 
requirement of 35 U.S.C. § 112. first paragraph 

Claims 46, 48, 49, 51, 53-60, and 66-68 stand rejected under 35 U.S.C. § 1 12, first 
paragraph, based on the allegation that the specification does not describe the subject ntiatter in such a 
way as to reasonably convey to one of skill in the art that the inventors, at the time the application was 
filed, had possession of the claimed invention. The Examiner asserts that "the disclosure of SEQ ID 
NO: 1 does not define the structural basis for the asserted and/or recited functional attributes of 
antibodies that brad the genericaUy recited fragments and variants of SEQ ID NO: 1" (Office Action, 
April 7, 2003; page 3). This rejection is traversed. 

The requirements necessary to fulfill the written description requirement of 35 U.S.C. § 1 12, 

first paragraph, are well established by case law. 

. . . the applicant must also convey with reasonable clarity to those skilled in the art that, 
as of the filing date sought, he or she was in possession of the invention. The invention 
is, for purposes of the "written description" inquiry, whatever is now claimed. 
VaS'Cath, Inc. v. Mahurkar, 19 USPQ2d 1111, 1117 (Fed. Cir. 1991) 
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Attention is also drawn to the Patent and Trademark Office's own "Guidelines for Examination 
of Patent Applications Under the 35 U.S.C. Sec. 112, para. V\ pubUshed January 5, 2001, which 
provide that: 

An applicant may also show that an invention is conplete by disclosure of sufficiently 
detailed, relevant identifying characteristics which provide evidence that applicant was 
LQ possession of the claimed invention, i.e., conq)lete or partial structure, other physical 
and/or chemical properties, functional characteristics when coupled with a known or 
disclosed correlation between function and structure, or some combination of such 
characteristics. What is conventional or well known to one of ordinary skill in the art 
need not be disclosed in detail. If a skilled artisan would have understood the inventor 
to be in possession of the claimed invention at the time of filing, even if every nuance of 
the claims is not explicitly described in the specification, then the adequate description 
requirement is met. [footnotes omitted] 

Thus, the written description standard is fulfilled by both what is specifically disclosed and what 
is conventional or well known to one skilled in the art. 

A. The specification provides an adequate written description of the claimed antibodies which 
specificaUy bind to the recited "variants'^ and "fragments'' of SEQ ID NO:l. 

The subject matter encompassed by claims 46, 48, 49, 5 1, 53-60, and 66-68 is either 
disclosed by the specification or is conventional or well known to one skilled in the art. 

First note that the 'Variant" language of independent claim 46 recites polypeptides comprising 
"a polypeptide having a naturally occxxrring anrino acid sequence at least 90% identical to the amino 
acid sequence of SEQ ID NO: 1, wherein the polypeptide has nucleotide pyrophosphohydrolase 
activity." Furthermore, the 'fragment" language of independent claim 46 recites polypeptides 
conqjrising "a fragment of a polypeptide having the amino acid sequence of SEQ ID NO: 1, wherein the 
fragment has nucleotide pyrophosphohydrolase activity," and polypeptides conprising "an immunogenic 
fragment of a polypeptide having the amino acid sequence of SEQ ID NO: 1 ." The polypeptide 
sequence of SEQ ID NO: 1 is explicitly disclosed in the specification. See, for example, the Sequence 
Listing and Figures 1 A, IB, IC, ID, IE, IF, IG, IH, II, IJ, IK, 2A, 2B, and 2C. Variants of SEQ 
ID NO: 1 are described in the specification at, for exanple, page 3, lines 20-22; page 8, lines 16-25; 
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page 8, line 30 to page 9, line 4; page 11, lines 13-15; page 13, lines 1-3; page 15, lines 4-5 and 16- 
24; page 17, lines 1-5; and page 21, lines 20-22; and fragments of SEQ ID NO: 1 are described at, for 
example, page 3, lines 20-22 and 27-29; page 4, lines 24-29; page 8, lines 26-30; page 9, lines 15-28; 
page 10, lines 7-11; page 18, lines 6-11; page 21, lines 11-15; page 30, lines 23-25; page 31, lines 7- 
13; page 45, lines 28-30; page 55, lines 13-14; and page 55, line 28 to page 56, line 12. In addition, a 
specific assay to measure nucleotide pyrophosphohydrolase activity is disclosed in the specification at, 

for exanple, page 55, lines 22-26. 

One of ordinary skill in the art would recognize polypeptide sequences which are variants that 
are at least 90% identical to SEQ ID NO: 1. Given any naturally occurring polypeptide sequence, it 
would be routine for one of skill in the art to recognize whether it was a variant of SEQ ID NO: 1 . It 
would also be routine to determine whether such a variant had nucleotide pyrophosphohydrolase 
activity, using the disclosed nucleotide pyrophosphohydrolase assay. Accordingly, the specification 
provides an adequate written description of the claimed antibodies which specifically bind to the recited 

polypeptide variants of SEQ ED NO: 1. 

One of ordinary skill in the art would recognize polypeptide sequences which are fragments of 
SEQ ID NO: 1. The amino acid sequence of SEQ ID NO: 1 provides the necessary framework for the 
recited fragments - to recite every possible fragment would needlessly clutter the application. It would 
be routine for one of skill in the art to determine whether any particular fragment of SEQ ID NO: 1 had 
nucleotide pyrophosphohydrolase activity, using the disclosed nucleotide pyrophosphohydrolase assay. 
Likewise, it would be routine for one of skill in the art to determine whether any particular fragment of 
SEQ ID NO: 1 had immunogenic activity, based on the methods recited in the specification at, for 
example, page 9, lines 13-29; page 30, line 15 to page 32, line 15; and page 55, line 28 to page 56, 
line 12. Accordingly, the specification provides an adequate written description of the claimed 
antibodies which specifically bind to the recited polypeptide fragments of SEQ ID NO: 1. 

The Examiner asserts that *nhe structural basis of the recited functional limitations common to 
the claimed genus of fragments or variants is not disclosed, and thus one of skill could not readily 
distinguish between the genus of antibodies that specifically bind to fragments and variants of SEQ ID 
NO: 1 that have nucleotide phosphorylase activity from the genus of antibodies that specifically bind to 



112988 



6 



09/757,716 



Docket No.: PF-0420-2 DIV 

fragments of SEQ ID NO: 1 that do not that have nucleotide phosphorylase activity" (Office Action, 
April 7, 2003; page 3). However, there is no requirement to provide a "structural basis" for the recited 
functional limitations in order for a skilled artisan to be able to distinguish antibodies that specifically 
bind to SEQ ID NO: I variauts and fragments having nucleotide pyrophosphohydrolase activity from 
antibodies that specifically bind to SEQ ID NO: 1 variants and fragments lacking nucleotide 
pyrophosphohydrolase activity. A skilled artisan could distinguish one genus of antibodies from the 
other by determining whether any particular SEQ ID NO: 1 variant or fragment, which is specifically 
bound by an antibody, has nucleotide pyrophosphohydrolase activity. For exanple, one of skill in the 
art could routinely make such a deteraiination using the assay for nucleotide pyrophosphohydrolase 
activity disclosed in the specification at page 55, lines 22-26. 

Furthermore, the Examiner asserts that one of skill in the art could not "readily distinguish 
between the genus of fragments of SEQ ID NO: 1 that have biological activity from the genus of 
fragments of SEQ ID NO: 1 that do not have biolo0cal activity " (Office Action, April 7, 2003; page 
3; eufiphasis added). This assertion is irrelevant to the issue at hand because the claims recite fragments 
of SEQ ID NO: 1 that have nucleotide pyrophosphohydrolase activity or immunogenic activity. As 
discussed above, one of skill in the art coiild routinely determine whether any particular fragment of 
SEQ ID NO: 1 had nucleotide pyrophosphohydrolase activity by, for exanple, using the assay for 
nucleotide pyrophosphohydrolase activity disclosed in the specification at page 55, lines 22-26. 
Likewise, it would be routtue for one of skill in the art to determine whether any particular fragment of 
SEQ ID NO: 1 had immunogenic activity, based on the methods recited in the specification at, for 
txaxnplt, page 9, lines 13-29; page 30, line 15 to page 32, line 15; and page 55, line 28 to page 56, 
line 12. 

1. The present claims specifically define the claimed genus through the recitation of 
chemical structure 

Court cases in which 'T)NA claims" have been at issue (which are hence relevant to claims to 
proteins encoded by the DNA) commonly emphasize that the recitation of structural features or 
chemical or physical properties are in5)ortant factors to consider in a written description analysis of 
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such claims. For example, in Fiers v. Revel, 25 USPQ2d 1601, 1606 (Fed. Cir. 1993), the court 
stated that: 

If a conception of a DNA requires a precise definition, such as by structure, formula, 
chemical name or physical properties, as we have held, then a description also requires 
that degree of specificity. 

In a number of instances in which claims to DNA have been found invalid, the courts have 
noted that the claims atten5)ted to define the claimed DNA in terms of functional characteristics without 
any reference to structural features. As set forth by the court in University of California v. Eli Lilly 
and Co., 43 USPQ2d 1398, 1406 (Fed. Cir. 1997): 

In claims to genetic material, however, a generic statement such as "vertebrate insulin 
cDNA" or "mammalian insulin cDNA," without more, is not an adequate written 
description of the genus because it does not distinguish the claimed genus fi-om others, 
except by function. 

Thus, the mere recitation of fimctional characteristics of a DNA, without the definition of 
structural features, has been a common basis by which courts have found invalid claims to DNA. For 
example, in Lilly, 43 USPQ2d at 1407, the court found invalid for violation of the written description 
requirement the following claim of U.S. Patent No. 4,652,525: 

1. A recombinant plasmid replicable in procaryotic host containing within its nucleotide 
sequence a subsequence having the structure of the reverse transcript of an mRNA of a 
vertebrate, which mRNA encodes insulin. 

In Fiers, 25 USPQ2d at 1603, the parties were in an interference involving the following count: 

A DNA which consists essentially of a DNA which codes for a human fibroblast 
interferon-beta polypeptide. 

Party Revel in the Fiers case argued that its foreign priority application contained an adequate 
written description of the DNA of the count because that application mentioned a potential method for 
isolating the DNA. The Revel priority application, however, did not have a description of any particular 
DNA structure corresponding to the DNA of the count. The court therefore found that the Revel 
priority application lacked an adequate written description of the subject matter of the coimt. 
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Thus, in Lilly and Fiers, nucleic acids were defined on the basis of functional characteristics 

and were found not to coirply with the written description requirement of 35 U.S.C, § 1 12; Le., "an 

mRNA of a vertebrate, which mRNA encodes insulin" in Lilly, and 'T)NA which codes for a human 

fibroblast interferon-beta polypeptide" in Fiers. hx contrast to the situation in Lilly and Fiers, the 

claims at issue in the present application define the polypeptides bound by the claimed antibodies in 

terms of chemical structure, rather than functional characteristics. For exanq)le, the language of 

independent claim 46 recites chemical structure to define the claimed genus: 

46. An isolated antibody which specifically binds to a polypeptide conprising a 
polypeptide selected from the group consisting of: 

a) a polypeptide having the amino acid sequence of SEQ ED NO: 1, 

b) a polypeptide having a naturally occurring amino acid sequence at least 90% 

identical to the anoino acid sequence of SEQ ID NO: 1, wherein the 
polypeptide has nucleotide pyrophosphohydrolase activity, 

c) a fragment of a polypeptide having the amino acid sequence of SEQ ID NO: 1 , 

wherein the fragment has pyrophosphohydrolase activity, and 

d) an immunogenic fi-agment of a polypeptide having the amino acid sequence of 

SEQ ID NO: 1. 

From the above it should be apparent that the claims of the subject application are 
fundamentally different from those found invalid in Lilly and Fiers. The subject ufiatter of the present 
claims is defined iu terms of the chemical structure of SEQ ID NO: 1. la the present case, there is no 
reliance merely on a description of functional characteristics of the polypeptides specifically bound by 
the claimed antibodies. The polypeptides defined by the claims of the present appKcation recite 
structural features, and cases such as Lilly and Fiers stress that the recitation of structure is an 
inq)ortant factor to consider in a written description analysis of claims of this tjrpe. By failing to base the 
written description inquiry "on whatever is now claimed," the Examiner failed to provide an appropriate 
analysis of the present claims and how they differ fi'om those found not to satisfy the written description 
requirement in Lilly and Fiers, 

The Patent Office Guidelines indicate that evidence that Appellants were in possession of the 
claimed invention can include '^complete or partial structure, other physical and/or chemical properties, 
functional characteristics when coupled with a known or disclosed correlation between function and 
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structure, or some combination of such characteristics" (P.T.O. Guidelines, supra; emphasis added). 
The claimed antibodies which specifically bind the recited variants and fragments of the SEQ JD NO:l 
polypeptide have been described by chemical structure (e.g., relation of the recited polypeptide variants 
and fragments to SEQ E) NO: 1), physical properties (e.g., occurrence in nature of the recited 
polypeptide variants), and chemical properties (e.g., possession of nucleotide pyrophosphohydrolase 
activity by the recited polypeptide variants and fragments; specific binding of the claimed antibodies to 
the recited polypeptide variants and fragments). Therefore, the written description requirement has 
been met. 

2. The present claims do not define a genus which is ''highly variant" 

Fvuthermore, the claims at issue do not describe a genus which could be characterized as 
"highly variant." Available evidence illustrates that, rather than being a large variable genus, the claimed 

genus is of narrow scope. 

hi support of this assertion, the Examiner's attention is directed to the enclosed reference by 
Brenner et al. ("Assessing sequence comparison methods with reliable structurally identified distant 
evolutionary relationships," Proc. Natl. Acad. Sci. USA, 1998, 95:6073-6078). Through exhaustive 
analysis of a data set of proteins with known structural and functional relationships and with <90% 
overall sequence identity, Brenner et al. have determined that 30% identity is a reliable threshold for 
establishing evolutionary homology between two sequences aligned over at least 150 residues (Brenner 
et al., pages 6073 and 6076). Furthermore, local identity is particularly important in this case for 
assessing the significance of the alignments, as Brenner et al. further report that ^40% identity over at 
least 70 residues is reliable in signifying homology between proteins (Brenner et al., page 6076). 

The present application is directed, inter alia, to antibodies which specifically bind to 
polypeptides which are nucleotide pyrophosphohydrolases, including polypeptides which are nucleotide 
pyrophosphohydrolases related to the amino acid sequence of SEQ ID NO: 1. In accordance with 
Brenner et al. , naturally occurring molecules may exist which could be characterized as nucleotide 
pyrophosphohydrolases and which have as little as 30% identity over at least 150 residues to SEQ ID 
NO: 1. The "variant language" of the present claims recites a polypeptide comprising "a naturally 
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occurring amino acid sequence at least 90% identical to the amino acid sequence of SEQ ID NO: 1" 
(note that SEQ ID NO: 1 has 1 156 amino acid residues). This variation is far less than that of all 
potential nucleotide pyrophosphohydrolases related to SEQ ED NO:l, i.e., those nucleotide 
pyrophosphohydrolases having as little as 30% identity over at least 150 residues to SEQ ID N0:1. 

The Examiner asserts that "Applicant further contends that Brenner et al teach that 30% identity 
is a reliable threshold for establishing evolutionary homology between two sequences. However it is 
noted that the rejection is not based on the evolutionary homology between sequences but whether one 
of skiU can envision the claimed genus of antibodies which bind polypeptides which have the disclosed 
asserted function of having nucleotide phosphorylase activity, from those that don't" (Office Action, 
April 7, 2003; page 3). However, the Examiner's arguments do not address the degree of variation 
within the recited genus of polypeptides which are specifically bound by the claimed antibodies. The 
Brenner et al. reference has been provided as evidence that the recited genus of polypeptides is not 
highly variant because the criteria used to define the structures of the members of the claimed genus 
(e.g., at least 90% identical to a reference sequence such as SEQ ID NO: 1) are conservative relative to 
the broadest criteria which a skilled artisan would consider to be reasonable (e.g., the criteria of 
Brenner et al. that 30% identity over at least 150 residues, or 40% identity over at least 70 residues, 
reasonably denotes homology). Since the recited genus of polypeptides which are specifically bound 
by the claimed antibodies is not highly variant, one of skill in the art would reasonably understand that 
Appellants were in possession of the claimed invention at the time the application was filed. 

3. The state of the art at the time of the present invention is further advanced than at 
the time of the Lilly and Fiers applications 

In the Lilly case, claims of U.S. Patent No. 4,652,525 were found invalid for failing to comply 
with the written description requirement of 35 U.S.C. § 1 12. The '525 patent claimed the benefit of 
priority of two applications, Application Serial No. 801,343 filed May 27, 1977, and Application Serial 
No. 805,023 filed June 9, 1977. In the Fiers case, party Revel claimed the benefit of priority of an 
Israeli application filed on November 21, 1979. Thus, the written description inquiry in those cases 
was based on the state of the art at essentially the "dark ages" of recombinant DNA technology. 
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The present application has a priority date of December 22, 1997. Much has happened in the 
development of recombinant DNA technology in the 20 or so years from the time of filing of the 
applications involved in Lilly and Fiers and the present apphcation. For example, the technique of 
polymerase chain reaction (PGR) was invented. Highly efficient cloning and DNA sequencing 
technology has been developed. Large databases of protein and nucleotide sequences have been 
conpiled. Much of the raw material of the human and other genomes has been sequenced. With these 
remarkable advances, one of skill in the art would recognize that, given the sequence information of 
SEQ ID NO: 1, and the additional extensive detail provided by the subject application, the present 
inventors were in possession of the claimed antibodies which specifically bind the recited polypeptide 
variants and fragments at the time of filing of this application. 

4. Summary 

The Examiner failed to base the written description inquiry "on whatever is now claimed." 
Consequently, the Examiner did not provide an appropriate analysis of the present claims and how they 
differ from those found not to satisfy the written description requirement in cases such as Lilly and 
Fiers. In particular, the claims of the subject application are fundamentally different from those found 
invalid in Lilly and Fiers. The subject matter of the present claims is defined in terms of the chemical 
stmcture of SEQ ID NO: 1 . The courts have stressed that structural features are inq^ortant factors to 
consider in a written description analysis of claims to nucleic acids and proteins. In addition, the genus 
of polypeptides recited by the present claims is adequately described, as evidenced by Brenner et al. 
Furthermore, there have been remarkable advances in the state of the art since the Lilly and Fiers 
cases, and these advances were given no consideration whatsoever in the position set forth by the 
Examiner. 

For at least the reasons set forth above, the specification provides an adequate written 
description of the claimed antibodies which specifically bind to the recited polypeptide "variants" and 
"fragments," and this rejection should be overturned. 
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Issue 2 - Whether claims 46, 48, 49, 51, 53-60, and 66-68 meet the enablement requirement of 
35 U.S.C> § 112. first paragraph 

Claims 46, 48, 49, 51, 53-60, and 66-68 stand rejected under 35 U.S.C. § 112, first 
paragraph, based on the allegation that the specification does not describe the subject matter of the 
invention in such a way as to enable one of skill in the art to make and/or use antibodies which 
specifically bind to the recited "variants" and "fi^agments" of SEQ ID NO: 1 . hi particular, the Examiner 
asserts that "there is insufficient direction regarding how to make and use an antibody that specifically 
binds to any fi-agment or variant of SEQ ID NO: 1 , said variants and fi-agments encon5)assing a wide 
range of polypeptides" (Office Action, April 7, 2003; page 4). Such, however, is not the case. 

The specification discloses methods to make antibodies which specifically biud to a polypeptide 
having anj particular amino acid sequence (e.g., at page 30, line 15 to page 32, line 15; and page 55, 
line 28 to page 56, line 12). Given the mformation provided by SEQ ID NO: 1 (the amino acid 
sequence of NTPPH-2), one of skill in the art would be able to routinely obtain antibodies which 
specifically bind to any of the recited variants and fragments of SEQ ID NO: 1, including a polypeptide 
conprising "a polypeptide having a naturally occurring amino acid sequence at least 90% identical to 
the amino acid sequence of SEQ ID N0:1, wherein the polypeptide has nucleotide 
pyrophosphohydrolase activity," a polypeptide comprising "a fragment of a polypeptide having the 
amino acid sequence of SEQ ID NO: 1, wherein the fragment has pyrophosphohydrolase activity," and 
a polypeptide con5)rising "an immunogenic fragment of a polypeptide having the amino acid sequence 
of SEQ ID NO: 1." For exan^le, an animal could be immunized with any of the recited variants and 
fragments of SEQ ID NO: 1, antibodies could be isolated from the animal, and the antibodies could be 
screened to identify antibodies which specifically bind to the polypeptide. 

Likewise, the specification discloses methods to use antibodies which specifically bind to a 
polypeptide having any particular amino acid sequence in, for example, the purification of such 
polypeptides (e.g., at page 56, lines 14-24), the detection and/or measurement of such polypeptides 
(e.g., at page 26, line 25 to page 27, line 3; and page 38, line 14 to page 39, line 5), and the 
competitive screening of drug candidates (e.g., at page 46, lines 27-30). Given the information 
provided by SEQ ID NO: 1 (the amino acid sequence of NTPPH-2), one of skill in the art would be 
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able to routinely use antibodies which specifically bind to any of the recited variants and fi-agments of 
SEQ ED NO: 1, including a polypeptide conprising "a polypeptide having a naturally occurring amino 
acid sequence at least 90% identical to the amino acid sequence of SEQ ID NO: 1, wherein the 
polypeptide has nucleotide pyrophosphohydrolase activity," a polypeptide comprising "a fi-agment of a 
polypeptide having the amino acid sequence of SEQ ID NO: 1, wherein the fragment has nucleotide 
pyrophosphohydrolase activity," and a polypeptide coir^rising "an immunogenic fragment of a 
polypeptide having the amino acid sequence of SEQ ID NO: 1 For example, an antibody which 
specifically binds to any of the recited variants and fragments of SEQ ID NO: 1 could be coupled to an 
activated chromatographic resin, and this resin could then be used in an immunoaffinity column to purify 
the polypeptide. 

In support of this rejection, the Examiner has stated that "the specification does not appear to 
disclose the sequence of any said polypeptides conq5risiQg an amino acid sequence at least 90% 
identical to an amino acid sequence of SEQ ID NO: 1" (Office Action, October 21 , 2002; page 6). 
Furthermore, the Examiner has asserted that "[t]he specification does not appear to disclose or 
exempMy any said biologically active or immunogenic fragments of a polypeptide having an amino acid 
sequence of SEQ ID NO: 1" (M). The Examioer is incorrect in asserting that the recited polypeptide 
variants and fragments which are specifically bound by the claimed antibodies are not disclosed by the 
specification. Variants of SEQ ID NO: 1 are disclosed in the specification at, for example, page 3, lines 
20-22; page 8, lines 16-25; page 8, line 30 to page 9, line 4; page 15, lines 16-24; page 17, lines 1-5; 
and page 21, lines 20-22. Fragments of SEQ ID NO: 1 are disclosed in the specification at, for 
example, page 3, lines 20-22 and 27-29; page 8, lines 26-30; page 9, lines 15-28; page 10, lines 7-11; 
page 30, lines 23-25; page 31, lines 7-13; and page 55, line 28 to page 56, line 12. In addition, an 
assay to measure nucleotide pyrophosphohydrolase activity is disclosed in the specification at, for 
example, page 55, lines 22-26. Therefore, the recited polypeptide variants and fragments are frilly 
disclosed in the specification. Furthermore, antibodies which specifically bind to NTPPH-2, and 
variants and fragments thereof, are disclosed in the specification at, for example, page 4, lines 27-29; 
page 9, lines 13-21; and page 31, lines 7-13. 
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Furthermore, the Examiner has argued that "[w]ithout knowing the function of the polypeptides 
related to a polypeptide comprising an amino acid sequence comprising SEQ ID NO: 1, it would 
require undue experimentation for one of skill to predict the function of antibodies which specifically 
binds to said polypeptides" (Office Action, October 21, 2002; pages 6-7). This is incorrect. No 
undue experimentation would be required because it is a trivial matter to "predict the function of 
antibodies" which specifically bind to the recited polypeptides. The "function" of such antibodies is to 
s pecifically bind to the recited polypeptides, and a skilled artisan would recognize this immediately. 

Moreover, it would not require undue experimentation to make and use the claimed antibodies. 
Antibodies which specifically bind to a polypeptide can be made as long as that polypeptide, or 
fragments thereof, are available; there is no restriction on the amino acid sequence of polypeptides that 
can be used to make antibodies. Since a polypeptide having ans amino acid sequence (including any 
amino acid sequence that is 90% identical to SEQ ID NO: 1, any naturally occurring amino acid 
sequence that is 90% identical to SEQ ID NO: 1, and any fragment of SEQ ID NO: 1) can be used to 
make antibodies using the methods disclosed in the specification, it is not necessary to identify particular 
naturally occurring amino acid sequences that are 90% identical to SEQ ID NO: 1, or particular 
fragments of SEQ ID NO: 1 , that could be used in this manner. 

The Examiner states that "the rejection is based on the scope of the claimed variants and 
fragments of the polypeptides to which said antibodies specifically bind" (Office Action, April 7, 2003; 
page 4). hi particular, the Examiner asserts that "the problem of predicting what changes can be 
tolerated whHe still maintaining the functional nucleotide phosphorylase activity of the recited variants 
and fragments of SEQ ID NO: 1 , based on the sequence data of a single amino acid sequence (SEQ ID 
N0:1), is con5)lex and well outside the realm of routine experimentation" and that "even a single amino 
acid change in a polypeptide's amino acid sequence can have dramatic effects on its fiinction" (Jd.). In 
support of these assertions, the Examrner has cited Sugie et al. (Proc. Natl. Acad. Sci. USA, 1997, 
94:5278-5283). This reference teaches that human glycosylation-inhibitmg factor differs from human 
macrophage migration inhibitory factor by one amino acid residue, and yet these proteins do not share 
aU of their biological fiinctions (Office Action, October 21, 2002; page 6). 
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However, the Examiner's assertions are irrelevant because it is not necessary to predict what 
changes in SEQ ID NO: 1 "can be tolerated while still maintaining the functional nucleotide 
phosphorylase activity" in order to make and/or use the claimed antibodies. For example, the claimed 
antibodies include antibodies which specifically bind to a polypeptide conq)rising "a polypeptide having 
a naturally occurring amino acid sequence at least 90% identical to the amino acid sequence of SEQ ID 
N0:1, wherein the polypeptide has nucleotide pyrophosphohydrolase activity," and to a polypeptide 
conprising "a fragment of a polypeptide having the amino acid sequence of SEQ ED NO: 1, wherein the 
fragment has nucleotide pyrophosphohydrolase activity." An assay to measure nucleotide 
pyrophosphohydrolase activity is disclosed in the specification at, for example, page 55, Unes 22-26. 
One of ordinary skill in the art could routinely use the disclosed assay to identify polypeptide variants 
and fragments recited by the claims. One could then routinely make and/or use antibodies which 
specifically bind to these polypeptide variants and fragments. Contrary to the Examiner's assertions, no 
undue experimentation would be required. 

As set forth in /« reMarzocchU 169 USPQ 367, 369 (CCPA 1971): 

The first paragraph of § 112 requires nothing more than objective enablement. How 
such a teaching is set forth, either by the use of illustrative examples or by broad 
terminology, is of no inportance. 

As a matter of Patent Office practice, then, a specification disclosure which contains a 
teaching of the manner and process of making and using the invention in terms which 
correspond in scope to those used in describing and deftoing the subject matter sought 
to be patented must be taken as in coirpliance with the enabling requirement of the first 
paragraph of § 112 unless there is reason to doubt the objective truth of the statements 
contained therein which must be relied on for enabling support. 

Contrary to the standard set forth in MarzocchU the Examiner has failed to provide any 
reasons why one would doubt that the guidance provided by the present specification would enable 
one to make and use the claimed antibodies which specifically bind to the recited variants and fragments 
of SEQ ID NO: 1. Hence, a prima facie case for non-enablement has not been established with 
respect to the claimed antibodies which specifically bind to the recited variants and fragments of SEQ 
IDN0:1. 
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For at least the above reasons, reversal of this rejection is requested. 

Issue 3 - Whether clainis 46. 48. 49. 51. 53-60. and 66 meet the requirements of 35 U.S.C. § 
112, second paragraph 

Claims 46, 48-49, 51 , 53-60, and 66 were rejected under 35 U.S.C. § 1 12, second 
paragraph, based on the allegation that the recitation of "at least 90% identical" is indefinite. The 
Examiner asserts that "the algorithm used to define identity is not disclosed in the specification," and that 
"[i]t is not clear how an amino acid sequence can have homology to another anrno acid sequence" 
(Office Action, April 7, 2003; page 4). This rejection is traversed. 

Under the second paragraph of 35 U.S.C. § 1 12, the standard for "definiteness" is that the 
claims define patentable subject matter with a reasonable degree of precision and particularity. See In 
re Miller, 169 USPQ 597, 599 (CCPA 1971); In re Moore, 169 USPQ 236, 238 (CCPA 1971). 
See also M.P.E.P. § 706.03(d). In this regard, the Supreme Comt has indicated that the primary 
purpose of claim language is to give 'fair" notice of what would constitute the infringement of a claim. 
See United Carbon Co. v. Binny & Smith Co., 317 U.S. 228, 55 USPQ 381 (1942). In other 
words, the basic purpose of 35 U.S.C. § 1 12, second paragraph is to require a claim to reasonably 
apprise those skilled in the art of the scope of the iavention defined by that claim and give fair notice of 
what constitutes infringement of the claim See Antonius v. Pro Group Inc., 217 USPQ 875, 877 
(6th Cir. 1983). The present claims meet the legal standards required by 35 U.S.C. § 1 12, second 
paragraph. 

One of ordinary skill in the art would imderstand the meaning of the term "at least 90% 
identical" when this term is used for defining the structure of an amino acid sequence in relation to a 
reference amino acid sequence, as in the claims at issue. The Examiner recognizes this in stating that the 
term "identity" is "defined in the specification on page 1 1, by stating that the term identity may 
substituted for the term homology and refers to a degree of complementarity " (Office Action, April 
7, 2003; page 4; en5>hasis added). However, the Examiner errs m requiring an explicit disclosure of 
the algorithm used to calculate percent identity. A skilled artisan would reasonably understand that 
percent identity is singly the percentage of amrao acid residues in a polypeptide sequence which are 
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identical to those in a reference sequence. Moreover, a skilled artisan would know that the percent 
identity between two amino acid sequences can be calculated using basic mathematics. For exanple, 
to arrive at the percent identity, a subject sequence and a reference sequence are compared, the 
number of amino acids which are identical iu these sequences is summed up, and the result is divided by 
the total number of amino acids in the reference sequence. Therefore, the claims are definite ia their 
recitation of amino acid sequences which are "at least 90% identical" to the amino acid sequence of 
SEQIDNO:L 

The Examiner iasists that grounds for indefiniteness include "that there are several programs that 
use different algorithms to determine homology, and that the specification discloses no specific single 
algorithm" (Office Action, April 7, 2003; page 5). However, the Examiner provides no evidence to 
support these assertions. Calculations of percent identity between sequences, regardless of the 
algorithm used, are essentially the division of the number of identical amino acid residues by the total 
number of amino acid residues. Even if it were true that there are several programs using different 
algorithms to carry out such calculations, one of skiU in the art would nevertheless understand the basic 
calculation underlying all such algorithms. Thus, there is no need for a disclosure of only a single 
algorithm to determine percent identity in order to satisfy the requirements of the second paragraph of 
35 U.S.C. § 1 12. All that is necessary to satisfy the second paragraph of 35 U.S.C. § 1 12 is that one 
of skill in the art be able to reasonably determine what is within the scope of the claim In the present 
case, one of skiU in the art would reasonably understand whether any particular polypeptide sequence 
was at least 90% identical to SEQ ID NO: 1, without needing a disclosure of only a single program or 
algorithm 

For at least the above reasons, reversal of this rejection under 35 U.S.C. § 1 12, second 
paragraph, is requested. 

(9) CONCLUSION 

The written description rejections, enablement rejections, and indefiniteness rejections should 
be reversed, based on at least the arguments presented above. 
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Due to the urgency of this matter, and its economic and public health implications, an expedited 
review of this appeal is earnestly solicited. 

If the USPTO determines that any additional fees are due, the Commissioner is hereby 
authorized to charge Deposit Account No. 09-0108. 

This brief is enclosed in triplicate. 
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APPENDIX 

Claims on appeal: 

46. An isolated antibody which specifically binds to a polypeptide coirprising a polypeptide 
selected from the group consisting of: 

a) a polypeptide having the amino acid sequence of SEQ ID NO: 1, 

b) a polypeptide haviag a naturally occurring amino acid sequence at least 90% identical to the 
amino acid sequence of SEQ ID NO: 1 , wherein the polypeptide has nucleotide pyrophosphohydrolase 
activity, 

c) a fragment of a polypeptide having the amino acid sequence of SEQ ID NO: 1, wherein the 
fragment has nucleotide pyrophosphohydrolase activity, and 

d) an immunogenic fragment of a polypeptide haviag the amino acid sequence of SEQ ID 

N0:1. 

48. The antibody of claim 46, wherein the antibody is: 

a) a chimeric antibody, 

b) a siQgle chain antibody, 

c) a Fab fragment, 

d) a F(ab')2 fragment, or 

e) a humanized antibody. 

49. A conposition conprisiDg an antibody of claim 46 and an acceptable excipient. 
51. A composition of claim 49, further comprising a label. 

53. A method of preparing a polyclonal antibody with the specificity of the antibody of claim 
46, the method conq)rising: 
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a) irnrnimizing an animal with a polypeptide having an amino acid sequence of SEQ ID NO: 1 , 
or an immunogenic fragment thereof, under conditions to elicit an antibody response, 

b) isolating antibodies from said animal, and 

c) screening the isolated antibodies with the polypeptide, thereby identifying a polyclonal 
antibody which binds specifically to a polypeptide having an amino acid sequence of SEQ ID NO: 1. 

54. A polyclonal antibody produced by a method of claim 53. 

55- A corrposition conq^rising the antibody of claim 54 and a suitable carrier. 

56. A method of making a monoclonal antibody with the specificity of the antibody of claim 46, 
the method conprising: 

a) immunizing an animal with a polypeptide having an amino acid sequence of SEQ ID NO: 1, 
or an immunogenic fragment thereof, under conditions to elicit an antibody response, 

b) isolating antibody producing ceUs from the animal, 

c) fusing the antibody producing cells with immortalized ceUs to form monoclonal antibody- 
producing hybridoma ceUs, 

d) culturing the hybridoma cells, and 

e) isolating from the culture monoclonal antibody which binds specifically to a polypeptide 
having an amino acid sequence of SEQ ID NO: 1 . 

57. A monoclonal antibody produced by a n[iethod of claim 56. 

58. A con^osition conprising the antibody of claim 57 and a suitable carrier. 

59. The antibody of claim 46, wherein the antibody is produced by screening a Fab expression 

Hbrary. 
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60. The antibody of claim 46, wherein the antibody is produced by screening a recombinant 
inraiunoglobulin library. 

66. An isolated antibody of claim 46, which specifically binds to a polypeptide comprising a 
naturally occurring amino acid sequence at least 90% identical to the amiao acid sequence of SEQ ID 
NO: 1, wherein the polypeptide has nucleotide pyrophosphohydrolase activity. 

67. An isolated antibody of claim 46, which specifically biads to a fragment of a polypeptide, 
wherein the pol)^eptide consists of the amino acid sequence of SEQ ID NO: 1, and wherein the 
fragment has nucleotide pyrophosphohydrolase activity. 

68. An isolated antibody of claim 46, which specifically binds to an immunogenic fragment of a 
polypeptide, wherein the polypeptide consists of the amino acid sequence of SEQ ID NO: 1. 
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ABSTRACT Patrwise sequcnct comparisoD mcibods bavc 
becD assessed using proteins whose retationsfaips are known 
reliably from their structures and functions, as described in 
the SCOP database tMunin. A. Brenner. S. Hubbard. T. 
& Chothia C. (1995) Mol, Biol. 247, 536-540]. The evalua- 
tion tested the programs blast (Aitschul. S, Gish. 
Miller, W., Myers. £. W. & Lipman. D. J. (1990). X Md. Biol. 
215. 403-410], WU-BIJICT7 CAltscbul. S. F. & Gish, (1996) 
Methods Enzymoi. 266, 460-480], fASTA [Pearson, W. R. & 
Upmaa, D. J* (198S) Proc. Nail. Acad. Set. USA 85, 2444-2448], 
and SSEARCH (Smith, T. F. & Waurman, M. S. (1981) /. Mol. 
Biol. 147. 195*197} and their scoring schemes. The error rate 
of all algorithms is greatly reduced by using statistical scores 
to evaluate matches rather than percentage identity or raw 
scores. The £*valuc statistical scores of ssearch and fast a are 
reliable: the number of false positives found in our tests agrees 
well with the scores reported. However, the P*values reported 
by BLAST and wi;-BlA5T2 exaggerate significance by orders of 
magnitude. sseaRCH, FaSTa ktup = 1, and wi;*BlAST2 perform 
best, and tbey arc capable of detecting almost all relationships 
between proteins whose sequence identities are >30?c* For 
more distantly related proteins, they do much less well; only 
one-half of the relationships between proteins with 20-30% 
identity are found. Because many homologs have low sequence 
similarity, most distant relationships cannot be detected by 
any pairwise comparison method; however, those which are 
identified may be used with confidence. 



Sequence database searching plays a role in virtually every 
branch of molecular biology and is crucial for inierpreiing the 
sequences issuing forth from genome projects. Given the 
method*s central role, it is surprising that overall and relative 
capabilities of different procedures are largely unltnown. It is 
difficult to verify algorithms on sample data because this 
requires large data sets of proteins whose evolutionary rela- 
tionships are known unambiguously and independently of the 
methods being evaluated. However, nearly alt known ho- 
mologs have been identified by sequence analysis (the method 
10 be tested). Also, it is generally very difficult to know, in the 
absence of structural data, whether two proteins that lack dear 
sequence similarity are unrelated. 7T)is has meant that al- 
though previous evaluations have helped improve sequence 
comparison, they have suffered from insufficient, imperfectly 
characterized, or artificial test data. Assessment also has been 
problematic because high quality database sequence searching 
attempts to have both sensitivity (detection of homologs) and 
specificity (rejection of unrelated proteins); however, these 
complementary goals are linked such that increasing one 
causes the other to be reduced. 
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Sequence comparison methodologies have evolved rapidly, 
so no previously published tests has evaluated modem versions 
of programs commonly used. For example, parameien in 
BLAST (1 ) have changei and wu-blast: (2)— which produces 
gapped alignments — has become available. The latest version 
of Fast A (3) previously tested was 1.6. but the cuacnt release 
(version 3.0) provides fundamentally different results in the 
form of statistical scoring. 

The previous reports also have left gaps in our knowledge. 
For example, there has been no published assessment of 
thresholds for scoring schemes more sophisticated than per- 
centage identity. Thus, the widely discussed statistical scoring 
measures have never aaually been evaluated on large data- 
bases of real proteins. Moreover, the different scoring schemes 
commonly in use have not been compared 

Beyond these issues, there is a more fundamental question: 
in an absolute sense, how well does pairwise sequence com- 
parison work? That is. what fraaion of homologous proteins 
can be detected using modem database searching methods? 

In this work, we attempt to answer these questions and to 
overcome both of the fundamental difficulties that have hin- 
dered assessment of sequence comparison methodologies. 
First, we use the set of distant evolutionary relationships in the 
scof: Structural Classification of Proteins database (4), which 
is derived from structural and functional characierisiics (5). 
The SCOP database provides a uniquely reliable set of ho- 
mologs. which are known independently of sequence compar- 
ison. Second, we use an assessment method that jointly mea- 
sures both sensitivity and specificity. This method allows 
straightforward comparison of different sequence searching 
procedures. Further, it can be used to aid interpretation of real 
database searches and thus provide optimal and reliable 
results. 

Previous A3sessmeDts of Sequence Comparison. Several 
previous studies have examined the relative performance of 
different sequence comparison methods. The most encom- 
passing analyses have been by Pearson (6, 7), who compared 
the three most commonly used programs. Of these, the Smith- 
Waterman algorithm (8) implemented in ssearch (3) is the 
oldest and slowest but the most rigorous. Modem heuristics 
have provided BLAST (1) the speed and convenience to make 
it the most popular program. Iniermediaie between these two 
is PASTA (3). which may be run in two modes offering either 
greater speed fktup ~ 2> or greater effectiveness fkiup = 1). 
Pearson also considered different parameters for each of these 
programs. 

To test the methods. Pearson selected two representative 
proteins from each of 67 protein superfamilies defined by the 
PiR database (9). Each was used as a query to search the 
database, and the matched proteins were marked as bemg 
homologous or unrelated according to their membership of PIR 
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superfamilies. Pearson found lhai modcn) matrices and "In- 
scaling" f raw scores improve results considerably. He also 
reported that the rigorous Smith- Waterman algorithm worJced 
slightly bener than facta, which was in mm more effenive 
than BLAST. 

Very large scale analyses of matrices have been performed 
(10), and Henikoff and Henikoff (II) also evaluated the 
effectiveness of blast and fasta. Their test with blast 
considered the ability to detect homologs above a predeter- 
mined score but had no penally for methods which also 
reported large numbers of spurious matches. The Henikoffs 
searched the swiss-PROT database (12) and used prosfte f 13) 
to defme homologous families. Their results showed that the 
BLOSUM62 matrix (14) performed markedlv bener than the 
extrapolated PAM*series matrices (15), which previously had 
been popular. 

A crucial aspect of any assessment is the data that are used 
to test the ability of the program to fmd homoloEs. But in 
Pearson s and the Henikoffs' evaluations of sequence com- 
parison, the correct results were effeaiveW unknown. This b 
because the superfamilies in PIR and PROsrre are principally 
created by using the same sequence comparison methods 
which are bemg evaluated, inierdependency of data and 
methods creates a "chicken and egg'* problem, and means for 
example, that new methods would be penalized for correctly 
Identifying homologs missed by older programs. For instance, 
immunoglobulin variable and constant domains are dearly 
homologous, but pir places them in different superfamaics 
The problem is widespread: each superfamilv in PiR 48.00 with 
a structural homoiog is itself homologous to' an average of 1 6 
other PIR superfamilies (16). 

To surmount these sorts of difficulties, Sander and Schnei- 
der (17) used protein struaures to evaluate sequence com- 
parison. Rather than comparing different sequence compari- 
son algorithms, their work focused on determining a leneih- 
depcndeni threshold of percentage identity, above which all 
proteins would be of similar structure. A result of this analvsis 
was the hssp equation; it states that proteins with 25% ideniiiv 
over 80 residues will have similar structures, whereas shone'r 
alignmenu require higher identity. (Other studies also have 
used structures (18-20), but these focused on a small number 
of model protems and were principally oriented toward eval- 
uating alignment accuracy rather than homology detection.) 

A general solution to the problem of scoring comes from 
statistical measures (i.e., E-valucs and P-values) based on the 
extreme value distribution (21). Extreme value scoring was 
implemented analytically in the blast program using the 
Kariin and Altschul statistics (22, 23) and empirical ap- 
proaches have been recently added to fasta and ssearch. In 
addition to being heralded as a reliable means of recognizing 
significantly similar proteins (24, 25), the mathematical trac- 
lability of statistical scores "is a crucial feature of the BLAST 
algorithm" (1 ). The validity of this scoring procedure has been 
tested analytically and empirically (see ref 2 and references in 
ref. 24). However, all large empirical tests used random 
sequences that may lack the subtle strutture found within 
biological sequences (26. 27) and obviously do not contain anv 
real homologs. Thus, although many researchers have sug- 
gested that statistical scores be used to rank matches (24, 25. 
28). there have been no large rigorous experiments on biolog- 
ical data to determine the degree to which such rankings are 
superior. 

A Database for Testing Homology Detection, Since the 
discovery that the struaures of hemoglobin and mvoelobin are 
very similar though their sequences are not (29),' it^has been 
apparent that comparing structures is a more powerful (if less 
convenient) way to recognize distant evolutionary relation- 
ships than comparing sequences. If two proteins show a high 
degree of similarity in their structural details and function it 
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is very probable that they have an evolutionary relationship 
tftoueh their sequence similarity mav be low. 

The recent growth of proieii striiaure information com- 
inned with the comprehensive evolutionary classificaiion in 
the SCOP database (4, 5) have allowed us to oi^ercome previous 
limiiaiions. With these data, we can evaluate the performance 
of sequence comparison methods on real protein sequences 
whose relationships are known confidents . The scop database 
usesstruaural information to recognize distant homologs. the 
large majonty of which can be determined unambieuousN. 
rhese superfamilies, such as the globins or the immunoelobu- 
Iins, would be recognized as related bv the vast maioritv of the 
biological community despite the lack of high sequence sim- 
uaritv. 

From SCOP, we extraaed the sequences of domains of 
proteins in the Protein Data Bank (PDB) (30) and created two 
databases. One (pdbwd-b) has domains, which were all <90% 
Identical to any other, whereas (pdb40ikb) h^d those <AQ% 
identical. The databases were created by first soning all 
protein domams in scop by their quality and making a list. The 
highest quality domain was selected for inclusion in the 
database and removed from the list. Also removed from the list 
(and discarded) were all other domains above the threshold 
level of Identity to the selected domain. This process was 
repeated until the list was empty. The PDB40D-B database 
contains U23 domains, which have 9.044 ordered pairs of 
distant relationships, or -0.5% of the total 1,749.006 ordered 
pairs. In PDB90EVB, the Z079 domains have 53.988 relation- 
ships, representing \2% of all pairs. Low complexity regions 
of sequence can achieve spurious high scores, so these were 
masked in both databases by processing with the SEG program 
(i7) using recommended parameters: 12 1.8 2.0. The databases 
used m this paper are available from http://sss.stanford.edu/ 
sss/, and databases derived from the current version of scop 
may be found at http://scop.mrc-lmb-cam.ac.uk/scop/. 

Analyses from both databases were generally consistent, but 
PDB40D-B focuses OP distantly related proteins and reduces the 
hea>^' overrepresentation in the pdb of a small number of 
lamiiies (31, 32). whereas pdbwd-b (with more sequences) 
improves evaluations of statistics. Except where noted other- 
wise, the distant homoiog results here are from PDB40D-B 
Although the precise numbers reponed here are specific to the 
structural domain databases used, we expect the trends to be 
general. 

Assessment Data and Procedure. Our assessment of se- 
quence comparison may be divided into four different major 
categories of tests. First, using just a sineic sequence compar- 
ison algorithm at a time, we evaluated the effectiveness of 
different scoring schemes. Second, we assessed the reliability 
of scoring procedures, including an evaluation of the validity 
of statistical scoring. Third, we compared sequence compari- 
son algorithms (using the optimal scoring scheme) to deter- 
mine their relative performance. Fourth, we cxammed the 
distribution of homologs and considered the power of pairwisc 
sequence comparison to recognize them. All of the analyses 
used the databases of structurally identified homologs and a 
new assessment criterion. 

The analyses tested blast (1), version 1.4.9MP and wu- 
BIast: (2). version 2.0al3MP- Also assessed was the Fasta 
package, version 3.0t76 (3), which provided Fasta and the 
ssearch implementation of Smith-Waterman (8). For 
SSEARCH and fasta. wc used BLOSUM45 with gap penalties 
-12/-1 (7. 16) The default parameters and matru (BLO- 
SUM6:) were used for BLAST and wu-blasT2. 

The "Coverage Vs. Error** PloL To lest a particular protocol 
(comprising a program and scormg scheme), each sequence 
from the database was used as a querv to search the database. 
This yielded ordered pairs of querv and target sequences with 
associated scores, which were soned, on the basis of their 
scores, from best to worst. The ideal method would have 
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Fic. L Coverage vs. error plou of differeni scoring schemes for ssearch Smith- Waterman. (A) AnaWsis of rDMfO-B database. (B) Analysis 
of PDB90i>-B database. Al) of the proteins in the database were compared with each other using the ssEaRCN program. The results of this single 
set of comparisons were considered using five different scoring schemes and assessed. The graphs show the coverage and errors per ouery (EPO) 
for statistical scores, raw scores, and three measures using percentage identir>-. In the coverage vs. error ptoL the x axis indicates the fraction of 
ali homologs in the database (known from structure) which have been deteaed. Precisely, it is the number of detected pairs of proteins with the 
same fold divided by the total number of pairs from a common superfamily. rDB40D»B contains a total of 9.044 homologs. so a score of 10% indicates 
identification of 904 relationships. The y axis reports the number of £P0. Because there are I J23 queries made in the pdb«od-b all-vs.-all 
comparison. 13 errors corresponds to O.OI. or 1% EPO. They axis is presented on a log scale to shc^ results over the widely varying degrees of 
accuracy which may be desired. The scores that correspond to the levels of EPO and coverage are shown in Fig. 4 and Table 1. The graph 
demonstrates the trade-off between sensitivity and selectivity. As more homologs are found (moving to the right), more errors are made (moving 
up). The ideal method would be in the lower right comer of the graph, which corresponds to identifying many evolutionary relationships without 
selecting unrelated proteins. Three measures of percentage identity are plotted. Percentage identity within alignment is the degree of identity withm 
the aligned region of the proteins, without consideration of the alignment length. Percentage identity within both is the number of identical residues 
in the aligned region as a percentage of the average length of the query and target proteins. The Hss^ equation (17) is H * 290. 1 5/ where 
/ is length for 10 < / < 80: H > 100 for / < 10; H * 24.7 for / > 80. The percentage identity HssP-adjusied score is the percent identity within 
the alignment minus H. Smith-Waterman raw scores and E*vaiues were taken directh* from the sequence comparison program. 



perfect separation, with ail of the homologs at the top of the 
list and unrelated proteins below. In practice, perfect separa- 
tion is impossible to achieve so instead one is interested in 
drawing a threshold above which there are the largest number 
of related patrs of sequences consistent with an acceptable 
error rate. 

Our procedure involved measuring the coverage and error 
for everv threshold. Covcrase was defined as the fraction of 
Structurally determined homologs that have scores above the 
selected threshold; this reflects the sensitivity of a method. 
Errors per query (EPQ), an indicator of selectivity, is the 
number of nonhomologous pairs above the threshold divided 
by the number of queries. Graphs of these data, called 
coverage vs. error plots, were devised to understand how 



protocols compare at different levels of accuracy. These 
graphs share effectively all of the beneficial features of Re- 
cievcr (Operating Characteristic (ROC) plots (33. 34) but 
belter represent the high degrees of accuracy required in 
sequence comparison and the huge background of nonho- 
mologs- 

This assessment procedure is directly relevant to practical 
sequence database searching, for it provides precisely the 
information necessary to perform a reliable sequence database 
search. The EPO measure places a premium on score consis- 
tency: thai is. it requires scores to be comparable for different 
queries. Consistency is an aspect which has been largely 

^»ercent lOentfty of Onnimma Protect (PDBOOD-B) 
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FiC. 2. Unrelated proteins with high percentage identity. Hemo- 
globin ^-chain (PDB code Ihds chain b. ref. 38. Left) and celluiase £2 
(POB code Itml. ref. 39, Rtght) have 39% identity over fe4 residues, a 
level which is often believed to be indicative of homology. Despite this 
high degree of identity, their structures strongly suggest that these 
proteins are not related. Appropriately, neither the raw alignment 
score of 85 nor the £-value of 1 .3 is significant. Proteins rendered by 
RASMOL (40). 
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Fic. 3. Length and percentage identiiy of alignments of unrelated 
proteins in PDB^}t>-6: Each pair oi nonnomologous proteins found with 
sseaRCH is plotted as a point whose position indicates the length and 
the percentage identity wuhin the alignment. Because alignment 
length and percentage identity are quantized, many pairs ol protems 
may have exactly the same ahgnmem length and percentage tdemity. 
The line shows the HSSf* threshold (though it is intended to be applied 
with a differeni matrix and parameters). 
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Fig. 4. Rebabtiity of tuustical scores in pdbqod-b: Each hne shows 
the retaiionsbip betweeo reported statisucal score and actual error 
rate for a different program. £-vaJues are reported for ssearch and 
FaSTa. whereas P-vaiues are shown for SLhsr and wu-BLATn. If the 
scoring were perfect, then the number of errors per query and the 
£-values would be the satne. as indicated by the upper bold line. 
(P'Values should be the same as EPO for small numbers, and diverges 
at higher values, as indicated by tne lower bold line.) E-values from 
SSEARCH and Facta are snown to have good agreement with EPO but 
underestimate the significance siighijy. blast and wu-6i>sn are 
overconfident, with the degree of exaggeration dependent upon the 
score. The results for PDB40D-B were similar to those for ppBWtKB 
despite the difference in number of homologs detected. This graph 
could be used to roughly calibrate the reliability of a given statisucal 
score. 

ignored in previous tests but is essential for the straightforward 
or automatic interpretation of sequence comparison results. 
Further, it provides a clear indication of the confidence that 
should be ascribed to each match. Indeed, the EPQ tneasure 
should approximate the expeaation value reponed by data- 
base searching programs, if the programs' estimates are accu* 
rate. 

Tbe Performance of Scoring Schemes. All of the programs 
tested could provide three fundamental types of scores. The 
first score is the percentage identity, which may be computed 
in several ways based on either the length of the alignment or 
the lengths of the sequences. The second is a **raw" or 
"Smith-Waterman" score, which is tbe measure optimized by 
the Smith-Waterman algorithm and is computed by summmg 
the substitution matrbc scores for each position in the align- 
ment and subtracting gap penalties. In BLAST, a measure 
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related to this score is scaled into bits. Third is a statistical 
score based on the extreme value distribution. These results 
are summarized in Fig. 1. 

Sequence Identity, Though it has been lone established that 
pcrceniaee identity is a poor measure (35). there is a common 
rule-of-thumb slating that 50^^ identity signifies homology. 
Moreover, publications have indicated ihai"23C^ ideniirv can 
be used as a threshold (17. 36). We fmd that these thresholds, 
originally derived years ago, are not supponed by present 
results. As databases have grown, so have the possibilities for 
chance alignments with high identity: thus, the reported cutoffs 
lead to frequent errors. Fig. 2 shows one of the manv pairs of 
proteins with very different structures that nonetheless have 
high levels of identity over considerable aliened regions. 
Despite the high identity^ the raw and the statistical scores for 
such incorrect matches are typically noi significant. The prin- 
cipal reasons percentage identity does so poorly seem to be 
that it ignores information about gaps and about the conser- 
vative or radical nature of residue substitutions. 

From the PDB90D-B analysis in Fig. 3, we Icam that 30^ 
identity is a reliable threshold for this database only for 
sequence alignments of at least 150 residues. Because one 
unrelated pair of proteins has 433% identity over 62 residues, 
it is probably necessary for alignments to be at least 70 residues 
in length before 40% is a reasonable threshold, for a database 
of this panicular size and composition. 

At a given reliability, scores based on percentage identity 
detect just a fraction of the distant homologs found by 
statistical scoring. If one measures tbe percentage identity in 
the aligned regions without consideration of aligtiiment length, 
then a negligible number of distant homologs arc detected. 
Use of the HSSP equation improves the value of percentage 
identity, but even this measure can find only of all known 
homologs at 1% EPQ. In short, percentage identity discards 
most of the information measured in a sequence comparison. 

Raw Scores. Smith-Waicrman raw scores perform better 
than percentage identity (Fig. I ), bui In-scaling (7) provided no 
notable benefit in our analysis. It is necessary to be very precise 
when using either raw or bit scores because a 20^c change in 
cutoff score could yield a tenfold difference in EPO However, 
it is difficult to choose appropriate thresholds because the 
reliability of a bit score depends on the lengths of the proteins 
matched and the size of the database. Raw score thresholds 
also are affected by matrbc and gap parameters. 

Statistical Scores. Statistical scores were introduced partly 
to overcome the problems that arise from raw scores. This 
scoring scheme provides the best discrimination between 
homologous proteins and those which are unrelated Most 

Seouence Comparison Aigornhms (PD890D-B) 
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Fio. 5. Coverage vs. error plots of differem sequence comparison methods: Five differcni sequence comparison methods arc evaluated, each 
using siaiistica) scores (E- or P-values). {A ) PDBwd-B database. In this analysts, the best method is ihc slow sseaRCH. which fmdi \S% of relationships 
at 1% EPO- FaSTa kiup - 1 and wu-blast; arc almost as good. (B) pdbwi>b database. The quick wu-biast: program provides the best coverage 
at 1% EPO on this database, although at higher levels of error it becomes slightly worse than fasta ktup «^ 1 and ssearch. 
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likely, its power can be annbuted to its tncoiporation of more 
information than any other measure; it takes account of the 
fuU substitution and gap data (like raw scores) but also has 
details about the sequence lengths and composition and is 
scaled appropriately. 

We fmd that statistical scores are not only powerful but also 
easy to interpret, ssearch and fasta show dose agreement 
between statistical scores and actual number of errors per 
query (Fig- 4). The expeaation value score gives a good, 
slightly conservative estimate of the chances of the two sc* 
quences being found at random in a given quer\. Thus, an 
£-value of O.Ol indicates that roughly one pair of nonhomologs 
of this similarity should be found in every 100 different queries. 
Neither raw scores nor percentage tdenttty can be interpreted 
in this way, and these results validate the suitabiiiry of the 
extreme value distribution for describing the scores from a 
database search. 

The P-values from BLAST also should be directly interpret- 
able but were found to overstate significance by more than two 
orders of magnitude for 1 % EPQ for this database. Nonethe* 
less, these results strongly suggest that the analytic theory is 
fundamentally appropriate, wu-blast: scores were more re- 
liable than those from blast, but also exaggerate expected 
confidence by more than an order of magnitude at \% EPQ. 

Overall Detection of Homologs and Comparison of Algo- 
rithms. The results in Fig. SA and Table 1 show that pairwise 
sequence comparison is capable of identifying only a small 
fraction of the homologous pairs of sequences in PDB40D-B. 
Even SSEARCH with E-valucs. the best protocol tested, could 
find only 18% of all relationships at a 1% EPQ. BLAST, which 
identifies \5%. was the worst performer, whereas fasta 
ktup = 1 is nearly as effective as ssearch. fasta ktup - 2 and 
WU-BLAST2 are intermediate in their ability to detect ho- 
mologs. Comparison of different algorithms indicates that 
those capable of identifying more homologs are generally 
slower. ssEaRCH is 25 limes slower than blast and 6.5 times 
slower than FaSTa ktup = 1 . wu-BLAstz is slightly faster than 
FASTA ktup = 2. but the latter has more interpretable scores. 

In PDB90D-B. where there are many close relationships, the 
best method can identify only 38% of structurally known 
homologs (Fig. 5B). The method which finds that many 
relationships is wu-bijkst2. Consequently, we infer that the 
differences between Fasta kup = 1. ssearch, and wu- blast? 
programs are unlikely to be significant when compared with 
variation in database composition and scoring reliability. 

Fig. 6 helps to explain why most distant homologs cannot be 
found by sequence comparison: a great many such relation- 
ships have no more sequence identity Chan would be expected 
by chance. ssEaRCH with E-values can recognize >909£> of the 
homologous pain with 30-40% identity. In this region, there 
are 30 pairs of homologous proteins that do not have signif- 
icant E-values. but 26 of these involve sequences with <50 
residues. Of sequences having ZS-SO^c identit)-. 75 9f are 
identified bv ssearch E-values. However, althoueh the num- 
ber of homologs grows at lower levels of identit); the detection 
falls off sharply: only 40% of homologs with 20-25% identity 
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Fic^ 6. Disiribuiion and detection of homotogs in pdb40I>b. Ban 
show the distribution of homologous pairs rDB«oi>-B according to their 
identity (using the measure of identity in both). Filled regions indicate 
the number of these patrs found by the best database searching method 
(SSEARCH with £-vaiues) at \% EPO- The rDB«oo-B database contains 
proteins with <40% identity, and as shown on this graph, most 
structurally tdeniiried homolo^ in iht database bave diverged ex- 
treme ty far in sequence and have <209c identic. Note thai the 
alignments may be inaccurate, especially at low levels of identity. Filled 
regions show that sseaRCH can identify most relationships that have 
25% or more identity, but its detection wanes sharply belo» 25%. 
Consequently, the great sequence divergence of most structurally 
identified evolutionary relationships efleciively defeats the ability of 
panwise sequence comparison to detect them. 

arc detected and only 10% of those with 15-20% can be found. 
These results show that statistical scores can find related 
proteins whose identtry is remarkably low; however, the power 
of the method is restricted by the great divergence of many 
protein sequences. 

After completion of this work, a new version of pairwise 
BLAST was released: BLASTCP (37). It supports gapped align* 
ments. like wu-bljvst:. and dispenses with sum statistics. Our 
initial tesu on BLASTCP using defauh parameters show that its 
E-values arc reliable and that its overall detection of homologs 
was substantially better than that of ungapped blast, but not 
quite equal to that of wu-bl\st:. 

CONCLUSION 

The general consensus amongst experts (see rcfs. 7. 2*1. 25. 21 
and references therein ) suggests that the most effective se- 
quence searches are made by (i) using a large current database 
in which the protein sequences have been complexity masked 
and using statistical scores to mierpret the results. Our 
experiments fully support this view 

Our results also suggest two further points. First, the E-val- 
ucs reported by fasTa and SSEaRCH give tairly accurate 
estimates of the significance ol each match, but the P-valucs 
provided by blast and w^i-blaSt: underestimate the true 



Table 1. Summary of sequence comparison methods with pdb40D*b 



Method 


Relative Time* 


Kf EPO CuiofI 


Coverage ai l'~c EPO 


ssearch % identity: withm alignment 


255 




<UJ 


ssEakCH % identity: within both 


255 




3.0 


ssEAJtCH % identity: HSSP-scaled 


255 


35Tr,(HSSP - 9.8) 


4.0 


ssearch Smith-Waterman raw scores 


255 




10.5 


ss£arch E-values 


255 


0,03 


164 


FASTA ktup * I E-values 


3.9 


0.03 


17 V 


faSTa ktup ■ 2 E-values 


].4 


0.03 


lb*' 


*i;-bu^st: P- values 


1.1 


0.003 


17.5 


BLAST P-values 


1.0 


0.0001 6 


14. K 


*Ttmes are from large database searches with genome proteins. 
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exicm of errors. Second, ssearch, wu-Bb^STZ and fasta 
kiup = 1 pcrf rtn best, though %iASi and fasta ktup ^ 2 
detea most of the relati nships f und by the best procedures 
and are appropriate f r rapid initial searches. 

The homologous proteins that are found by sequence com- 
parison can be distinguished with high reliabiiiiy from the huge 
number of unrelated pairs. However, even the best database 
searching procedures tested fail to find the large majority of 
dbtant evolutionary relationships at an acceptable error rate. 
Thus, if the procedures assessed here fail to fmd a reliable 
match, it does not imply that the sequence is unique: rather, it 
indicates that any relatives it might have are distant ones.'* 



** Additional utd updated informaticM) about this worL including 
supplementary figures, may be found at bttp://sssxtBnford.edu/s&s/. 
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